Humanity’s Last Exam (HLE)

The hardest AI benchmark ever built: 2,500 expert-level questions designed to be the final closed-ended academic exam for AI

Published

August 20, 2025

Keywords: Humanity’s Last Exam, HLE, AI benchmark, frontier LLM evaluation, CAIS, Scale AI, expert-level questions, calibration error, MMLU saturation, multi-modal benchmark, LLM leaderboard

Introduction

AI benchmarks are critical for measuring LLM progress — but most of them are already saturated. Frontier models now score over 90% on popular benchmarks like MMLU and GPQA, making them ineffective at distinguishing between state-of-the-art models.

Humanity’s Last Exam (HLE) was created to address this. It is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind. Where other benchmarks have become routine for frontier LLMs, HLE remains brutally difficult — with even the best models scoring well below 50%.

“HLE tests structured academic problems rather than open-ended research or creative problem-solving abilities, making it a focused measure of technical knowledge and reasoning. HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI.” — HLE Paper

graph LR
    A["Traditional Benchmarks<br/>(MMLU, GPQA, etc.)<br/>90%+ accuracy"] --> B["Benchmark<br/>Saturation"]
    B --> C["Humanity's Last Exam<br/>2,500 expert questions<br/>Best models < 45%"]
    C --> D["Meaningful signal<br/>for frontier AI"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Humanity’s Last Exam?

Humanity’s Last Exam (HLE) is a multi-modal benchmark consisting of 2,500 questions across dozens of academic subjects — mathematics, humanities, natural sciences, and more. It was designed to test both:

Depth of reasoning — world-class mathematical and scientific problems
Breadth of knowledge — questions spanning over 100 subject areas

Key Characteristics

Feature	Details
Total questions	2,500 (public) + private held-out set
Subjects covered	100+ across math, humanities, natural sciences
Question types	Multiple-choice (24%) and short-answer (76%)
Multi-modal	14% of questions require understanding images/diagrams
Grading	Automated (closed-form, unambiguous answers)
Anti-contamination	Private test set to detect overfitting; canary strings

What Makes It So Hard?

Every question in HLE was:

Created by subject-matter experts — nearly 1,000 contributors across 500+ institutions in 50+ countries (professors, researchers, PhD holders)
Required to stump frontier LLMs — a question only passed the initial bar if models could not answer it correctly
Manually reviewed by expert reviewers with graduate degrees in relevant fields
Verified unsearchable — questions that could be easily answered via web search were removed

The dataset started with over 70,000 submissions. Only 13,000 passed the LLM difficulty filter. After expert human review, a finalized set of 2,500 public questions remained.

graph TD
    A["70,000+ submissions<br/>from global experts"] --> B["13,000 passed<br/>LLM difficulty filter"]
    B --> C["Expert human review<br/>(graduate-level reviewers)"]
    C --> D["2,700 accepted"]
    D --> E["Remove searchable<br/>& flagged questions"]
    E --> F["2,500 finalized<br/>public questions"]

    style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#8e44ad,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333

Who Built It?

HLE was developed by the Center for AI Safety (CAIS) and Scale AI, with lead authors:

Long Phan, Nathaniel Li, Adam Khoja, Richard Ren — Center for AI Safety
Alice Gatti, Ziwen Han, Josephina Hu, Hugh Zhang — Scale AI
Summer Yue, Alexandr Wang — Scale AI (senior leads)
Dan Hendrycks — Center for AI Safety (senior lead)

Contributors competed for a $500,000 USD prize pool ($5,000 for each of the top 50 questions, $500 for the next 500 questions), along with optional co-authorship.

Publication

HLE was published in Nature (Nature 649, 1139–1146, January 2026), one of the most prestigious scientific journals, underscoring its significance to the research community.

Resource	Link
Nature paper	nature.com/articles/s41586-025-09962-4
arXiv preprint	arxiv.org/abs/2501.14249
Website	lastexam.ai
GitHub	github.com/centerforaisafety/hle

What Skills Does It Test?

Unlike narrowly focused benchmarks, HLE tests a broad spectrum of expert-level academic capabilities:

graph TD
    HLE["Humanity's Last Exam<br/>2,500 questions"] --> M["Mathematics<br/>& Logic"]
    HLE --> S["Natural Sciences<br/>Physics, Chemistry, Biology"]
    HLE --> H["Humanities<br/>History, Classics, Philosophy"]
    HLE --> CS["Computer Science<br/>& Engineering"]
    HLE --> Med["Medicine<br/>& Life Sciences"]
    HLE --> Other["Other Disciplines<br/>Economics, Law, Linguistics..."]

    style HLE fill:#e74c3c,color:#fff,stroke:#333
    style M fill:#3498db,color:#fff,stroke:#333
    style S fill:#27ae60,color:#fff,stroke:#333
    style H fill:#f39c12,color:#fff,stroke:#333
    style CS fill:#8e44ad,color:#fff,stroke:#333
    style Med fill:#e67e22,color:#fff,stroke:#333
    style Other fill:#6cc3d5,color:#fff,stroke:#333

Capability	What HLE Tests
Deep reasoning	Multi-step mathematical proofs, complex derivations
Expert knowledge	Cutting-edge scientific facts, obscure domain knowledge
Multi-modal understanding	Questions with diagrams, inscriptions, chemical structures
Calibration	Whether models know what they don’t know (confidence estimation)
Resistance to search	Knowledge that cannot be trivially retrieved via internet search

Example Questions

HLE questions span extraordinary breadth — from translating Palmyrene script on Roman tombstones (Classics) to identifying the number of paired tendons supported by a hummingbird’s sesamoid bone (Ecology/Anatomy). This diversity is what makes HLE uniquely challenging.

Current Leaderboard

The leaderboard below shows model accuracy on HLE as published on the SEAL LLM Leaderboard by Scale AI. Rankings use Rank (Upper Bound): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound, ensuring rankings reflect statistically meaningful differences.

Source: SEAL LLM Leaderboard — Humanity’s Last Exam (consulted March 28, 2026). Dataset updated April 3, 2025, with finalized 2,500 questions. Judge model: o3-mini.

Rank	Model	Accuracy (%)	Calibration Error
1	GPT-5.4 Pro	44.32 ± 1.95	38
2	Gemini 3 Pro Preview	37.52 ± 1.90	57
2	GPT-5.4 (xhigh thinking)	36.24 ± 1.88	42
2	Claude Opus 4.6 (thinking max)	34.44 ± 1.86	46
4	GPT-5 Pro	31.64 ± 1.82	49
6	GPT-5.2	27.80 ± 1.76	45
6	GPT-5	25.32 ± 1.70	50
6	Claude Opus 4.5 (thinking)	25.20 ± 1.70	55
6	Kimi K2.5	24.37 ± 1.81	67
7	GPT-5.1 (thinking)	23.68 ± 1.67	55
9	Gemini 2.5 Pro (Jun 05)	21.64 ± 1.61	72
11	o3 (high)	20.32 ± 1.58	34
11	GPT-5 Mini	19.44 ± 1.55	65
11	o3 (medium)	19.20 ± 1.54	39
11	Claude Opus 4.6 (non-thinking)	19.00 ± 1.54	44

Key takeaway: Even the best frontier model (GPT-5.4 Pro) scores only 44.32% — meaning more than half the questions remain unsolved. Most models exhibit high calibration errors, indicating systematic overconfidence.

For the full, up-to-date leaderboard, visit the links in the next section.

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource	Description	Link
SEAL LLM Leaderboard	Scale AI’s official leaderboard with confidence intervals and calibration	labs.scale.com/leaderboard/humanitys_last_exam
CAIS AI Dashboard	Center for AI Safety’s dashboard with HLE-Rolling live submission	agi.safe.ai/dashboard
HLE Website	Official website with paper, results, and progress chart	lastexam.ai

Dataset and Code

Resource	Description	Link
Hugging Face Dataset	The full 2,500-question dataset (requires access agreement)	huggingface.co/datasets/cais/hle
GitHub Repository	Evaluation code, prompts, and documentation	github.com/centerforaisafety/hle
arXiv Paper	Full technical paper with methodology and analysis	arxiv.org/abs/2501.14249
Nature Publication	Peer-reviewed publication	nature.com/articles/s41586-025-09962-4

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("cais/hle", split="test")

HLE-Rolling

In October 2025, the team released HLE-Rolling — a dynamic, evolving fork of the benchmark that accepts new contributions over time. This ensures HLE remains relevant as models improve.

Understanding the Metrics

Accuracy

The primary metric. Models answer each question, and an automated judge (o3-mini) compares the response against the ground-truth answer. Because answers are closed-form and unambiguous, evaluation is deterministic.

Calibration Error

Models are prompted to provide both an answer and a confidence score (0–100%). Calibration error measures the gap between stated confidence and actual accuracy.

Scenario	Confidence	Accuracy	Calibration
Well-calibrated	50%	50%	Good
Overconfident	85%	10%	Bad (CE: 75+)
Current frontier models	60–90%	5–45%	Bad (CE: 34–89)

Key insight: Most frontier models are systematically overconfident on HLE — they express high confidence even when wrong. This is strong evidence of confabulation/hallucination. The o3 model family shows the best calibration (CE: 34–39), while older models like GPT-4o exhibit calibration errors of 89.

Why HLE Matters

graph LR
    A["Benchmark<br/>Saturation"] --> B["Cannot distinguish<br/>frontier models"]
    B --> C["HLE fills the gap"]
    C --> D["Informed AI policy<br/>& research"]

    A2["Overconfident<br/>models"] --> B2["Calibration errors<br/>not flagged"]
    B2 --> C
    C --> D2["Better safety<br/>assessments"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Measures what matters — Expert-level academic reasoning, not just pattern matching
Resists saturation — Even the best models score < 50%
Exposes overconfidence — Calibration metrics reveal when models are hallucinating
Informs policy — Provides a common reference point for scientists and policymakers
Anti-contamination — Private held-out set detects overfitting to the public dataset

Video: Humanity’s Last Exam Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Humanity’s Last Exam represents a milestone in AI evaluation:

2,500 expert-crafted questions across 100+ subjects that frontier LLMs still largely cannot solve
Built by ~1,000 subject-matter experts from 500+ institutions across 50+ countries
Published in Nature — peer-reviewed and validated by the scientific community
The best model scores 44% — vast room for improvement remains
Calibration errors reveal that models don’t know what they don’t know

As AI capabilities advance, HLE provides a meaningful yardstick for measuring genuine progress — not just incremental improvements on already-saturated benchmarks. When models eventually achieve high accuracy on HLE, it will signal a profound leap in AI’s ability to match expert human knowledge on closed-ended academic questions.

But as the authors note: “HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI.”

References

Phan, L., Gatti, A., Han, Z., Li, N. et al. “A benchmark of expert-level academic questions to assess AI capabilities.” Nature 649, 1139–1146 (2026). doi:10.1038/s41586-025-09962-4
Phan, L., Gatti, A., Han, Z., Li, N. et al. “Humanity’s Last Exam.” arXiv preprint arXiv:2501.14249 (2025). arxiv.org/abs/2501.14249
Center for AI Safety & Scale AI. “Humanity’s Last Exam — Official Website.” lastexam.ai
Scale AI. “SEAL LLM Leaderboard — Humanity’s Last Exam.” labs.scale.com/leaderboard/humanitys_last_exam (consulted March 28, 2026)
Center for AI Safety. “HLE Dataset.” Hugging Face. huggingface.co/datasets/cais/hle
Center for AI Safety. “HLE GitHub Repository.” github.com/centerforaisafety/hle

Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
Understand quantization trade-offs for evaluation — see Quantization Methods for LLMs
HLE Official Website
SEAL LLM Leaderboard
HLE Dataset on Hugging Face
HLE GitHub Repository